11/30/23
Q: Just to confirm, if we choose to do the CS02 do we still do it with our final group or the assigned CS02 group?
A: Final group!
Q: If we choose to do the final project instead of CS2, will everything be the same as our originally planned final projects in terms of contents except the grade weighting?
A: Yup!
Q: Understanding some of the variable’s meaning, maybe would have been good for us to recode them to make them more intuitive.
A: A good suggestion if you decide to go this route for the final!
Q: For the distribution on the diagonal line of the correlation graphs, what does x/y axis represent?
A: It’s a densityplot (shows the distribution) for each individual variable - the one that’s in that row.
Can we predict US annual average air pollution concentrations at the granularity of zip code regional levels using predictors such as data about population density, urbanization, road density, as well as, satellite pollution data and chemical modeling data?
Specify the split:
<Training/Testing/Total>
<584/292/876>
set.seed <- ensures we all get the exact same random split<training data sample number, testing data sample number, original sample number>More on how people decide what proportions to use for data splitting here
Split the Data
train_pm <- rsample::training(pm_split)
test_pm <- rsample::testing(pm_split)
# Scroll through the output!
count(train_pm, state)# A tibble: 49 × 2
state n
<chr> <int>
1 Alabama 13
2 Arizona 12
3 Arkansas 8
4 California 55
5 Colorado 10
6 Connecticut 12
7 Delaware 3
8 District Of Columbia 2
9 Florida 22
10 Georgia 20
# ℹ 39 more rows
recipe() + bake()Need to:
recipe provides a standardized format for a sequence of steps for pre-processing the data
The simplest approach…
…but we need to specify which column includes ID information
…and which are our predictors and which is our outcome
Summarizing our recipe thus far:
# A tibble: 50 × 4
variable type role source
<chr> <list> <chr> <chr>
1 id <chr [3]> id variable original
2 value <chr [2]> outcome original
3 fips <chr [3]> predictor original
4 lat <chr [2]> predictor original
5 lon <chr [2]> predictor original
6 state <chr [3]> predictor original
7 county <chr [3]> predictor original
8 city <chr [3]> predictor original
9 CMAQ <chr [2]> predictor original
10 zcta <chr [3]> predictor original
# ℹ 40 more rows
step*()
There are step functions for a variety of purposes:
This link and this link show the many options for recipe step functions.
There are several ways to select what variables to apply steps to:
tidyselect methods: contains(), matches(), starts_with(), ends_with(), everything(), num_range()all_nominal(), all_numeric() , has_type()all_predictors(), all_outcomes(), has_role()One-hot encoding categorical variables:
fips includes numeric code for state and county, so it’s another way to specify countyfips’ role"county id")Removing highly correlated variables:
CMAQ and aodrecipe togetherNote: order of steps matters
prep)There are some important arguments to know about:
training - you must supply a training data set to estimate parameters for pre-processing operations (recipe steps) - this may already be included in your recipe - as is the case for usfresh - if fresh=TRUE, will retrain and estimate parameters for any previous steps that were already prepped if you add more steps to the recipe (default is FALSE)verbose - if verbose=TRUE, shows the progress as the steps are evaluated and the size of the pre-processed training set (default is FALSE)retain - if retain=TRUE, then the pre-processed training set will be saved within the recipe (as template). This is good if you are likely to add more steps and do not want to rerun the prep() on the previous steps. However this can make the recipe size large. This is necessary if you want to actually look at the pre-processed data (default is TRUE)oper 1 step dummy [training]
oper 2 step corr [training]
oper 3 step nzv [training]
The retained training set is ~ 0.26 Mb in memory.
[1] "var_info" "term_info" "steps" "template"
[5] "levels" "retained" "requirements" "tr_info"
[9] "orig_lvls" "last_term_info"
This output includes a lot of information:
steps that were runvar_info)term_info)levels of the variablesorig_lvls)tr_info)bake()
bake(): apply our modeling steps (in this case just pre-processing on the training data) and see what it would do the data
Rows: 584
Columns: 37
$ id <fct> 18003.0004, 55041.0007, 6065.1003, 39009.0…
$ value <dbl> 11.699065, 6.956780, 13.289744, 10.742000,…
$ fips <fct> 18003, 55041, 6065, 39009, 39061, 24510, 6…
$ lat <dbl> 41.09497, 45.56300, 33.94603, 39.44217, 39…
$ lon <dbl> -85.10182, -88.80880, -117.40063, -81.9088…
$ CMAQ <dbl> 10.383231, 3.411247, 11.404085, 7.971165, …
$ zcta_area <dbl> 16696709, 370280916, 41957182, 132383592, …
$ zcta_pop <dbl> 21306, 4141, 44001, 1115, 6566, 934, 41192…
$ imp_a500 <dbl> 28.9783737, 0.0000000, 30.3901384, 0.00000…
$ imp_a15000 <dbl> 13.0547959, 0.3676404, 23.7457506, 0.33079…
$ county_area <dbl> 1702419942, 2626421270, 18664696661, 13043…
$ county_pop <dbl> 355329, 9304, 2189641, 64757, 802374, 6209…
$ log_dist_to_prisec <dbl> 6.621891, 8.415468, 7.419762, 6.344681, 5.…
$ log_pri_length_5000 <dbl> 8.517193, 8.517193, 10.150514, 8.517193, 9…
$ log_pri_length_25000 <dbl> 12.77378, 10.16440, 13.14450, 10.12663, 13…
$ log_prisec_length_500 <dbl> 6.214608, 6.214608, 6.214608, 6.214608, 7.…
$ log_prisec_length_1000 <dbl> 9.240294, 7.600902, 7.600902, 8.793450, 8.…
$ log_prisec_length_5000 <dbl> 11.485093, 9.425537, 10.155961, 10.562382,…
$ log_prisec_length_10000 <dbl> 12.75582, 11.44833, 11.59563, 11.69093, 12…
$ log_nei_2008_pm10_sum_10000 <dbl> 4.91110140, 3.86982666, 4.03184660, 0.0000…
$ log_nei_2008_pm10_sum_15000 <dbl> 5.399131, 3.883689, 5.459257, 0.000000, 6.…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.816047, 3.887264, 6.884537, 3.765635, 6.…
$ popdens_county <dbl> 208.719947, 3.542463, 117.314577, 49.64834…
$ popdens_zcta <dbl> 1276.059851, 11.183401, 1048.711994, 8.422…
$ nohs <dbl> 4.3, 5.1, 3.7, 4.8, 2.1, 0.0, 2.5, 7.7, 0.…
$ somehs <dbl> 6.7, 10.4, 5.9, 11.5, 10.5, 0.0, 4.3, 7.5,…
$ hs <dbl> 31.7, 40.3, 17.9, 47.3, 30.0, 0.0, 17.8, 2…
$ somecollege <dbl> 27.2, 24.1, 26.3, 20.0, 27.1, 0.0, 26.1, 2…
$ associate <dbl> 8.2, 7.4, 8.3, 3.1, 8.5, 71.4, 13.2, 7.6, …
$ bachelor <dbl> 15.0, 8.6, 20.2, 9.8, 14.2, 0.0, 23.4, 17.…
$ grad <dbl> 6.8, 4.2, 17.7, 3.5, 7.6, 28.6, 12.6, 12.3…
$ pov <dbl> 13.500, 18.900, 6.700, 14.400, 12.500, 3.5…
$ hs_orless <dbl> 42.7, 55.8, 27.5, 63.6, 42.6, 0.0, 24.6, 3…
$ urc2006 <dbl> 3, 6, 1, 5, 1, 1, 2, 1, 2, 6, 4, 4, 4, 4, …
$ aod <dbl> 54.11111, 31.16667, 83.12500, 33.36364, 50…
$ state_California <dbl> 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, …
$ city_Not.in.a.city <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
new_data = NULL specifies that we’re not (yet) looking at our testing datastate) are gone (one-hot encoding)state_California remains - only state with nonzero variance (largest # of monitors)bake()
bake()takes a trained recipe and applies the operations to a data set to create a design matrix. For example: it applies the centering to new data sets using these means used to create the recipe. -tidymodelsdocumentation
Typically, you want to avoid using your testing data…but our data set is not that large and NA values in our testing dataset could cause issues later on.
Rows: 292
Columns: 37
$ id <fct> 1033.1002, 1055.001, 1069.0003, 1073.0023,…
$ value <dbl> 11.212174, 12.375394, 10.508850, 15.591017…
$ fips <fct> 1033, 1055, 1069, 1073, 1073, 1073, 1073, …
$ lat <dbl> 34.75878, 33.99375, 31.22636, 33.55306, 33…
$ lon <dbl> -87.65056, -85.99107, -85.39077, -86.81500…
$ CMAQ <dbl> 9.402679, 9.241744, 9.121892, 10.235612, 1…
$ zcta_area <dbl> 16716984, 154069359, 162685124, 26929603, …
$ zcta_pop <dbl> 9042, 20045, 30217, 9010, 16140, 3699, 137…
$ imp_a500 <dbl> 19.17301038, 16.49307958, 19.13927336, 41.…
$ imp_a15000 <dbl> 5.2472094, 5.1612102, 4.7401296, 17.452484…
$ county_area <dbl> 1534877333, 1385618994, 1501737720, 287819…
$ county_pop <dbl> 54428, 104430, 101547, 658466, 658466, 194…
$ log_dist_to_prisec <dbl> 5.760131, 5.261457, 7.112373, 6.600958, 6.…
$ log_pri_length_5000 <dbl> 8.517193, 9.066563, 8.517193, 11.156977, 1…
$ log_pri_length_25000 <dbl> 10.15769, 12.01356, 10.12663, 12.98762, 12…
$ log_prisec_length_500 <dbl> 8.611945, 8.740680, 6.214608, 6.214608, 6.…
$ log_prisec_length_1000 <dbl> 9.735569, 9.627898, 7.600902, 9.075921, 8.…
$ log_prisec_length_5000 <dbl> 11.770407, 11.728889, 12.298627, 12.281645…
$ log_prisec_length_10000 <dbl> 12.840663, 12.768279, 12.994141, 13.278416…
$ log_nei_2008_pm10_sum_10000 <dbl> 6.69187313, 4.43719884, 0.92888890, 8.2097…
$ log_nei_2008_pm10_sum_15000 <dbl> 6.70127741, 4.46267932, 3.67473904, 8.6488…
$ log_nei_2008_pm10_sum_25000 <dbl> 7.148858, 4.678311, 3.744629, 8.858019, 8.…
$ popdens_county <dbl> 35.460814, 75.367038, 67.619664, 228.77763…
$ popdens_zcta <dbl> 540.8870404, 130.1037411, 185.7391706, 334…
$ nohs <dbl> 7.3, 4.3, 5.8, 7.1, 2.7, 11.1, 9.7, 3.0, 8…
$ somehs <dbl> 15.8, 13.3, 11.6, 17.1, 6.6, 11.6, 21.6, 1…
$ hs <dbl> 30.6, 27.8, 29.8, 37.2, 30.7, 46.0, 39.3, …
$ somecollege <dbl> 20.9, 29.2, 21.4, 23.5, 25.7, 17.2, 21.6, …
$ associate <dbl> 7.6, 10.1, 7.9, 7.3, 8.0, 4.1, 5.2, 6.6, 4…
$ bachelor <dbl> 12.7, 10.0, 13.7, 5.9, 17.6, 7.1, 2.2, 7.8…
$ grad <dbl> 5.1, 5.4, 9.8, 2.0, 8.7, 2.9, 0.4, 4.2, 3.…
$ pov <dbl> 19.0, 8.8, 15.6, 25.5, 7.3, 8.1, 13.3, 23.…
$ hs_orless <dbl> 53.7, 45.4, 47.2, 61.4, 40.0, 68.7, 70.6, …
$ urc2006 <dbl> 4, 4, 4, 1, 1, 1, 2, 3, 3, 3, 2, 5, 4, 1, …
$ aod <dbl> 36.000000, 43.416667, 33.000000, 39.583333…
$ state_California <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ city_Not.in.a.city <dbl> NA, NA, NA, 0, 1, 1, 1, NA, NA, NA, 0, NA,…
Hmm….lots of NAs now in city_Not.in.a.city
Likely b/c there are cities in our testing dataaset that were not in our training dataset…
A quick return to wrangling…and re-splitting our data
pm <- pm %>%
mutate(city = case_when(city == "Not in a city" ~ "Not in a city",
city != "Not in a city" ~ "In a city"))
set.seed(1234) # same seed as before
pm_split <-rsample::initial_split(data = pm, prop = 2/3)
pm_split<Training/Testing/Total>
<584/292/876>
And a recipe update…(putting it all together)
novel_rec <- recipe(train_pm) %>%
update_role(everything(), new_role = "predictor") %>%
update_role(value, new_role = "outcome") %>%
update_role(id, new_role = "id variable") %>%
update_role("fips", new_role = "county id") %>%
step_dummy(state, county, city, zcta, one_hot = TRUE) %>%
step_corr(all_numeric()) %>%
step_nzv(all_numeric()) re-bake()
Looking at the output
Rows: 584
Columns: 38
$ id <fct> 18003.0004, 55041.0007, 6065.1003, 39009.0…
$ value <dbl> 11.699065, 6.956780, 13.289744, 10.742000,…
$ fips <fct> 18003, 55041, 6065, 39009, 39061, 24510, 6…
$ lat <dbl> 41.09497, 45.56300, 33.94603, 39.44217, 39…
$ lon <dbl> -85.10182, -88.80880, -117.40063, -81.9088…
$ CMAQ <dbl> 10.383231, 3.411247, 11.404085, 7.971165, …
$ zcta_area <dbl> 16696709, 370280916, 41957182, 132383592, …
$ zcta_pop <dbl> 21306, 4141, 44001, 1115, 6566, 934, 41192…
$ imp_a500 <dbl> 28.9783737, 0.0000000, 30.3901384, 0.00000…
$ imp_a15000 <dbl> 13.0547959, 0.3676404, 23.7457506, 0.33079…
$ county_area <dbl> 1702419942, 2626421270, 18664696661, 13043…
$ county_pop <dbl> 355329, 9304, 2189641, 64757, 802374, 6209…
$ log_dist_to_prisec <dbl> 6.621891, 8.415468, 7.419762, 6.344681, 5.…
$ log_pri_length_5000 <dbl> 8.517193, 8.517193, 10.150514, 8.517193, 9…
$ log_pri_length_25000 <dbl> 12.77378, 10.16440, 13.14450, 10.12663, 13…
$ log_prisec_length_500 <dbl> 6.214608, 6.214608, 6.214608, 6.214608, 7.…
$ log_prisec_length_1000 <dbl> 9.240294, 7.600902, 7.600902, 8.793450, 8.…
$ log_prisec_length_5000 <dbl> 11.485093, 9.425537, 10.155961, 10.562382,…
$ log_prisec_length_10000 <dbl> 12.75582, 11.44833, 11.59563, 11.69093, 12…
$ log_prisec_length_25000 <dbl> 13.98749, 13.15082, 13.44293, 13.58697, 14…
$ log_nei_2008_pm10_sum_10000 <dbl> 4.91110140, 3.86982666, 4.03184660, 0.0000…
$ log_nei_2008_pm10_sum_15000 <dbl> 5.399131, 3.883689, 5.459257, 0.000000, 6.…
$ log_nei_2008_pm10_sum_25000 <dbl> 5.816047, 3.887264, 6.884537, 3.765635, 6.…
$ popdens_county <dbl> 208.719947, 3.542463, 117.314577, 49.64834…
$ popdens_zcta <dbl> 1276.059851, 11.183401, 1048.711994, 8.422…
$ nohs <dbl> 4.3, 5.1, 3.7, 4.8, 2.1, 0.0, 2.5, 7.7, 0.…
$ somehs <dbl> 6.7, 10.4, 5.9, 11.5, 10.5, 0.0, 4.3, 7.5,…
$ hs <dbl> 31.7, 40.3, 17.9, 47.3, 30.0, 0.0, 17.8, 2…
$ somecollege <dbl> 27.2, 24.1, 26.3, 20.0, 27.1, 0.0, 26.1, 2…
$ associate <dbl> 8.2, 7.4, 8.3, 3.1, 8.5, 71.4, 13.2, 7.6, …
$ bachelor <dbl> 15.0, 8.6, 20.2, 9.8, 14.2, 0.0, 23.4, 17.…
$ grad <dbl> 6.8, 4.2, 17.7, 3.5, 7.6, 28.6, 12.6, 12.3…
$ pov <dbl> 13.500, 18.900, 6.700, 14.400, 12.500, 3.5…
$ hs_orless <dbl> 42.7, 55.8, 27.5, 63.6, 42.6, 0.0, 24.6, 3…
$ urc2006 <dbl> 3, 6, 1, 5, 1, 1, 2, 1, 2, 6, 4, 4, 4, 4, …
$ aod <dbl> 54.11111, 31.16667, 83.12500, 33.36364, 50…
$ state_California <dbl> 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, …
$ city_Not.in.a.city <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
Making sure the NA issue is taken are of:
Rows: 292
Columns: 38
$ id <fct> 1033.1002, 1055.001, 1069.0003, 1073.0023,…
$ value <dbl> 11.212174, 12.375394, 10.508850, 15.591017…
$ fips <fct> 1033, 1055, 1069, 1073, 1073, 1073, 1073, …
$ lat <dbl> 34.75878, 33.99375, 31.22636, 33.55306, 33…
$ lon <dbl> -87.65056, -85.99107, -85.39077, -86.81500…
$ CMAQ <dbl> 9.402679, 9.241744, 9.121892, 10.235612, 1…
$ zcta_area <dbl> 16716984, 154069359, 162685124, 26929603, …
$ zcta_pop <dbl> 9042, 20045, 30217, 9010, 16140, 3699, 137…
$ imp_a500 <dbl> 19.17301038, 16.49307958, 19.13927336, 41.…
$ imp_a15000 <dbl> 5.2472094, 5.1612102, 4.7401296, 17.452484…
$ county_area <dbl> 1534877333, 1385618994, 1501737720, 287819…
$ county_pop <dbl> 54428, 104430, 101547, 658466, 658466, 194…
$ log_dist_to_prisec <dbl> 5.760131, 5.261457, 7.112373, 6.600958, 6.…
$ log_pri_length_5000 <dbl> 8.517193, 9.066563, 8.517193, 11.156977, 1…
$ log_pri_length_25000 <dbl> 10.15769, 12.01356, 10.12663, 12.98762, 12…
$ log_prisec_length_500 <dbl> 8.611945, 8.740680, 6.214608, 6.214608, 6.…
$ log_prisec_length_1000 <dbl> 9.735569, 9.627898, 7.600902, 9.075921, 8.…
$ log_prisec_length_5000 <dbl> 11.770407, 11.728889, 12.298627, 12.281645…
$ log_prisec_length_10000 <dbl> 12.840663, 12.768279, 12.994141, 13.278416…
$ log_prisec_length_25000 <dbl> 13.79973, 13.70026, 13.85550, 14.45221, 13…
$ log_nei_2008_pm10_sum_10000 <dbl> 6.69187313, 4.43719884, 0.92888890, 8.2097…
$ log_nei_2008_pm10_sum_15000 <dbl> 6.70127741, 4.46267932, 3.67473904, 8.6488…
$ log_nei_2008_pm10_sum_25000 <dbl> 7.148858, 4.678311, 3.744629, 8.858019, 8.…
$ popdens_county <dbl> 35.460814, 75.367038, 67.619664, 228.77763…
$ popdens_zcta <dbl> 540.8870404, 130.1037411, 185.7391706, 334…
$ nohs <dbl> 7.3, 4.3, 5.8, 7.1, 2.7, 11.1, 9.7, 3.0, 8…
$ somehs <dbl> 15.8, 13.3, 11.6, 17.1, 6.6, 11.6, 21.6, 1…
$ hs <dbl> 30.6, 27.8, 29.8, 37.2, 30.7, 46.0, 39.3, …
$ somecollege <dbl> 20.9, 29.2, 21.4, 23.5, 25.7, 17.2, 21.6, …
$ associate <dbl> 7.6, 10.1, 7.9, 7.3, 8.0, 4.1, 5.2, 6.6, 4…
$ bachelor <dbl> 12.7, 10.0, 13.7, 5.9, 17.6, 7.1, 2.2, 7.8…
$ grad <dbl> 5.1, 5.4, 9.8, 2.0, 8.7, 2.9, 0.4, 4.2, 3.…
$ pov <dbl> 19.0, 8.8, 15.6, 25.5, 7.3, 8.1, 13.3, 23.…
$ hs_orless <dbl> 53.7, 45.4, 47.2, 61.4, 40.0, 68.7, 70.6, …
$ urc2006 <dbl> 4, 4, 4, 1, 1, 1, 2, 3, 3, 3, 2, 5, 4, 1, …
$ aod <dbl> 36.000000, 43.416667, 33.000000, 39.583333…
$ state_California <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ city_Not.in.a.city <dbl> 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, …
parsnip)There are four things we need to define about our model:
rand_forest(), logistic_reg() etc.)set_engine() function)set_mode() function)set_args()function - for example the mtry = argument for random forest which is the number of variables to be used as options for splitting at each tree node)parsnip.workflows package allows us to keep track of both our pre-processing steps and our model specificationPM_wflow <- workflows::workflow() %>%
workflows::add_recipe(novel_rec) %>%
workflows::add_model(lm_PM_model)
PM_wflow══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps
• step_dummy()
• step_corr()
• step_nzv()
── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)
Computational engine: lm
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps
• step_dummy()
• step_corr()
• step_nzv()
── Model ───────────────────────────────────────────────────────────────────────
Call:
stats::lm(formula = ..y ~ ., data = data)
Coefficients:
(Intercept) lat
2.936e+02 3.261e-02
lon CMAQ
1.586e-02 2.463e-01
zcta_area zcta_pop
-3.433e-10 1.013e-05
imp_a500 imp_a15000
5.064e-03 -3.066e-03
county_area county_pop
-2.324e-11 -7.576e-08
log_dist_to_prisec log_pri_length_5000
6.214e-02 -2.006e-01
log_pri_length_25000 log_prisec_length_500
-5.411e-02 2.204e-01
log_prisec_length_1000 log_prisec_length_5000
1.154e-01 2.374e-01
log_prisec_length_10000 log_prisec_length_25000
-3.436e-02 5.224e-01
log_nei_2008_pm10_sum_10000 log_nei_2008_pm10_sum_15000
1.829e-01 -2.355e-02
log_nei_2008_pm10_sum_25000 popdens_county
2.403e-02 2.203e-05
popdens_zcta nohs
-2.132e-06 -2.983e+00
somehs hs
-2.956e+00 -2.962e+00
somecollege associate
-2.967e+00 -2.999e+00
bachelor grad
-2.979e+00 -2.978e+00
pov hs_orless
1.859e-03 NA
urc2006 aod
2.577e-01 1.535e-02
state_California city_Not.in.a.city
3.114e+00 -4.250e-02
# A tibble: 36 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 2.94e+ 2 1.18e+ 2 2.49 0.0130
2 lat 3.26e- 2 2.28e- 2 1.43 0.153
3 lon 1.59e- 2 1.01e- 2 1.58 0.115
4 CMAQ 2.46e- 1 3.97e- 2 6.20 0.00000000108
5 zcta_area -3.43e-10 1.60e-10 -2.15 0.0320
6 zcta_pop 1.01e- 5 5.33e- 6 1.90 0.0578
7 imp_a500 5.06e- 3 7.42e- 3 0.683 0.495
8 imp_a15000 -3.07e- 3 1.16e- 2 -0.263 0.792
9 county_area -2.32e-11 1.97e-11 -1.18 0.238
10 county_pop -7.58e- 8 9.29e- 8 -0.815 0.415
# ℹ 26 more rows
Understanding what variables are most important in our model…
A closer look at monitors in CA:
Remember: machine learning (ML) as an optimization problem that tries to minimize the distance between our predicted outcome \(\hat{Y} = f(X)\) and actual outcome \(Y\) using our features (or predictor variables) \(X\) as input to a function \(f\) that we want to estimate.
\[d(Y - \hat{Y})\]
Let’s pull out our predicted outcome values \(\hat{Y} = f(X)\) from the models we fit (using different approaches).
wf_fit <- PM_wflow_fit %>%
extract_fit_parsnip()
wf_fitted_values <-
broom::augment(wf_fit[["fit"]], data = baked_train) %>%
select(value, .fitted:.std.resid)
head(wf_fitted_values)# A tibble: 6 × 6
value .fitted .hat .sigma .cooksd .std.resid
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 11.7 12.2 0.0370 2.05 0.0000648 -0.243
2 6.96 9.14 0.0496 2.05 0.00179 -1.09
3 13.3 12.6 0.0484 2.05 0.000151 0.322
4 10.7 10.4 0.0502 2.05 0.0000504 0.183
5 14.5 11.9 0.0243 2.05 0.00113 1.26
6 12.2 9.52 0.476 2.04 0.0850 1.81
❓ What do you notice about/learn from these results?
\[RMSE = \sqrt{\frac{\sum_{i=1}^{n}{(\hat{y_t}- y_t)}^2}{n}}\]
Can use the yardstick package using the rmse()` function to calculate:
# A tibble: 3 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 rmse standard 1.98
2 rsq standard 0.392
3 mae standard 1.47
Resampling + Re-partitioning:
Preparing the data for cross-validation:
Note: this is called v-fold or k-fold CV
rsample()# 4-fold cross-validation
# A tibble: 4 × 2
splits id
<list> <chr>
1 <split [438/146]> Fold1
2 <split [438/146]> Fold2
3 <split [438/146]> Fold3
4 <split [438/146]> Fold4
Visualizing this process:
Where this workflow thing really shines…
Fitting a different model
Based on a decision tree:
But…in the case of random forest:
recipe()RF_rec <- recipe(train_pm) %>%
update_role(everything(), new_role = "predictor")%>%
update_role(value, new_role = "outcome")%>%
update_role(id, new_role = "id variable") %>%
update_role("fips", new_role = "county id") %>%
step_novel("state") %>%
step_string2factor("state", "county", "city") %>%
step_rm("county") %>%
step_rm("zcta") %>%
step_corr(all_numeric())%>%
step_nzv(all_numeric())step_novel()necessary here for the state variable to get all cross validation folds to work, (b/c there will be different levels included in each fold test and training sets. The new levels for some of the test sets would otherwise result in an error.; “step_novel creates a specification of a recipe step that will assign a previously unseen factor level to a new value.”Model parameters:
mtry - The number of predictor variables (or features) that will be randomly sampled at each split when creating the tree models. The default number for regression analyses is the number of predictors divided by 3.min_n - The minimum number of data points in a node that are required for the node to be split further.trees - the number of trees in the ensembleRF_wflow <- workflows::workflow() %>%
workflows::add_recipe(RF_rec) %>%
workflows::add_model(RF_PM_model)
RF_wflow══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
6 Recipe Steps
• step_novel()
• step_string2factor()
• step_rm()
• step_rm()
• step_corr()
• step_nzv()
── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)
Main Arguments:
mtry = 10
min_n = 3
Computational engine: randomForest
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
6 Recipe Steps
• step_novel()
• step_string2factor()
• step_rm()
• step_rm()
• step_corr()
• step_nzv()
── Model ───────────────────────────────────────────────────────────────────────
Call:
randomForest(x = maybe_data_frame(x), y = y, mtry = min_cols(~10, x), nodesize = min_rows(~3, x))
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 10
Mean of squared residuals: 2.633639
% Var explained: 59.29
set.seed(456)
resample_RF_fit <- tune::fit_resamples(RF_wflow, vfold_pm)
collect_metrics(resample_RF_fit)# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 1.67 4 0.101 Preprocessor1_Model1
2 rsq standard 0.591 4 0.0514 Preprocessor1_Model1
Hyperparameters are often things that we need to specify about a model. Instead of arbitrarily specifying this, we can try to determine the best option for model performance by a process called tuning.
Rather than specifying values, we can use tune():
Create Workflow:
RF_tune_wflow <- workflows::workflow() %>%
workflows::add_recipe(RF_rec) %>%
workflows::add_model(tune_RF_model)
RF_tune_wflow══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()
── Preprocessor ────────────────────────────────────────────────────────────────
6 Recipe Steps
• step_novel()
• step_string2factor()
• step_rm()
• step_rm()
• step_corr()
• step_nzv()
── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)
Main Arguments:
mtry = tune()
min_n = tune()
Computational engine: randomForest
Detect how many cores you have access to:
This code will take some time to run:
# install.packages("doParallel")
doParallel::registerDoParallel(cores = n_cores)
set.seed(123)
tune_RF_results <- tune_grid(object = RF_tune_wflow, resamples = vfold_pm, grid = 20)
tune_RF_results# Tuning results
# 4-fold cross-validation
# A tibble: 4 × 4
splits id .metrics .notes
<list> <chr> <list> <list>
1 <split [438/146]> Fold1 <tibble [40 × 6]> <tibble [0 × 3]>
2 <split [438/146]> Fold2 <tibble [40 × 6]> <tibble [0 × 3]>
3 <split [438/146]> Fold3 <tibble [40 × 6]> <tibble [1 × 3]>
4 <split [438/146]> Fold4 <tibble [40 × 6]> <tibble [0 × 3]>
There were issues with some computations:
- Warning(s) x1: 36 columns were requested but there were 35 predictors in the dat...
Run `show_notes(.Last.tune.result)` for more information.
# A tibble: 40 × 8
mtry min_n .metric .estimator mean n std_err .config
<int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 12 33 rmse standard 1.72 4 0.0866 Preprocessor1_Model01
2 12 33 rsq standard 0.562 4 0.0466 Preprocessor1_Model01
3 27 35 rmse standard 1.69 4 0.102 Preprocessor1_Model02
4 27 35 rsq standard 0.563 4 0.0511 Preprocessor1_Model02
5 22 40 rmse standard 1.71 4 0.106 Preprocessor1_Model03
6 22 40 rsq standard 0.556 4 0.0543 Preprocessor1_Model03
7 1 27 rmse standard 2.03 4 0.0501 Preprocessor1_Model04
8 1 27 rsq standard 0.440 4 0.0245 Preprocessor1_Model04
9 6 32 rmse standard 1.77 4 0.0756 Preprocessor1_Model05
10 6 32 rsq standard 0.552 4 0.0435 Preprocessor1_Model05
# ℹ 30 more rows
# A tibble: 1 × 3
mtry min_n .config
<int> <int> <chr>
1 32 11 Preprocessor1_Model10
The testing data!
# specify best combination from tune in workflow
RF_tuned_wflow <-RF_tune_wflow %>%
tune::finalize_workflow(tuned_RF_values)
# fit model with those parameters on train AND test
overallfit <- RF_wflow %>%
tune::last_fit(pm_split)
collect_metrics(overallfit)# A tibble: 2 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 rmse standard 1.72 Preprocessor1_Model1
2 rsq standard 0.608 Preprocessor1_Model1
Results are similar to what we saw in training (RMSE: 1.65)
Packages needed:
sf - the simple features package helps to convert geographical coordinates into geometry variables which are useful for making 2D plotsmaps - this package contains geographical outlines and plotting functions to create plots with mapsrnaturalearth- this allows for easy interaction with map data from Natural Earth which is a public domain map datasetRows: 241
Columns: 64
$ scalerank <int> 3, 1, 1, 1, 1, 3, 3, 1, 1, 1, 3, 1, 5, 3, 1, 1, 1, 1, 1, 1,…
$ featurecla <chr> "Admin-0 country", "Admin-0 country", "Admin-0 country", "A…
$ labelrank <dbl> 5, 3, 3, 6, 6, 6, 6, 4, 2, 6, 4, 4, 5, 6, 6, 2, 4, 5, 6, 2,…
$ sovereignt <chr> "Netherlands", "Afghanistan", "Angola", "United Kingdom", "…
$ sov_a3 <chr> "NL1", "AFG", "AGO", "GB1", "ALB", "FI1", "AND", "ARE", "AR…
$ adm0_dif <dbl> 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0,…
$ level <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
$ type <chr> "Country", "Sovereign country", "Sovereign country", "Depen…
$ admin <chr> "Aruba", "Afghanistan", "Angola", "Anguilla", "Albania", "A…
$ adm0_a3 <chr> "ABW", "AFG", "AGO", "AIA", "ALB", "ALD", "AND", "ARE", "AR…
$ geou_dif <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ geounit <chr> "Aruba", "Afghanistan", "Angola", "Anguilla", "Albania", "A…
$ gu_a3 <chr> "ABW", "AFG", "AGO", "AIA", "ALB", "ALD", "AND", "ARE", "AR…
$ su_dif <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ subunit <chr> "Aruba", "Afghanistan", "Angola", "Anguilla", "Albania", "A…
$ su_a3 <chr> "ABW", "AFG", "AGO", "AIA", "ALB", "ALD", "AND", "ARE", "AR…
$ brk_diff <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ name <chr> "Aruba", "Afghanistan", "Angola", "Anguilla", "Albania", "A…
$ name_long <chr> "Aruba", "Afghanistan", "Angola", "Anguilla", "Albania", "A…
$ brk_a3 <chr> "ABW", "AFG", "AGO", "AIA", "ALB", "ALD", "AND", "ARE", "AR…
$ brk_name <chr> "Aruba", "Afghanistan", "Angola", "Anguilla", "Albania", "A…
$ brk_group <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ abbrev <chr> "Aruba", "Afg.", "Ang.", "Ang.", "Alb.", "Aland", "And.", "…
$ postal <chr> "AW", "AF", "AO", "AI", "AL", "AI", "AND", "AE", "AR", "ARM…
$ formal_en <chr> "Aruba", "Islamic State of Afghanistan", "People's Republic…
$ formal_fr <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ note_adm0 <chr> "Neth.", NA, NA, "U.K.", NA, "Fin.", NA, NA, NA, NA, "U.S.A…
$ note_brk <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "Multiple claim…
$ name_sort <chr> "Aruba", "Afghanistan", "Angola", "Anguilla", "Albania", "A…
$ name_alt <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ mapcolor7 <dbl> 4, 5, 3, 6, 1, 4, 1, 2, 3, 3, 4, 4, 1, 7, 2, 1, 3, 1, 2, 3,…
$ mapcolor8 <dbl> 2, 6, 2, 6, 4, 1, 4, 1, 1, 1, 5, 5, 2, 5, 2, 2, 1, 6, 2, 2,…
$ mapcolor9 <dbl> 2, 8, 6, 6, 1, 4, 1, 3, 3, 2, 1, 1, 2, 9, 5, 2, 3, 5, 5, 1,…
$ mapcolor13 <dbl> 9, 7, 1, 3, 6, 6, 8, 3, 13, 10, 1, NA, 7, 11, 5, 7, 4, 8, 8…
$ pop_est <dbl> 103065, 28400000, 12799293, 14436, 3639453, 27153, 83888, 4…
$ gdp_md_est <dbl> 2258.0, 22270.0, 110300.0, 108.9, 21810.0, 1563.0, 3660.0, …
$ pop_year <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ lastcensus <dbl> 2010, 1979, 1970, NA, 2001, NA, 1989, 2010, 2010, 2001, 201…
$ gdp_year <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ economy <chr> "6. Developing region", "7. Least developed region", "7. Le…
$ income_grp <chr> "2. High income: nonOECD", "5. Low income", "3. Upper middl…
$ wikipedia <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ fips_10 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ iso_a2 <chr> "AW", "AF", "AO", "AI", "AL", "AX", "AD", "AE", "AR", "AM",…
$ iso_a3 <chr> "ABW", "AFG", "AGO", "AIA", "ALB", "ALA", "AND", "ARE", "AR…
$ iso_n3 <chr> "533", "004", "024", "660", "008", "248", "020", "784", "03…
$ un_a3 <chr> "533", "004", "024", "660", "008", "248", "020", "784", "03…
$ wb_a2 <chr> "AW", "AF", "AO", NA, "AL", NA, "AD", "AE", "AR", "AM", "AS…
$ wb_a3 <chr> "ABW", "AFG", "AGO", NA, "ALB", NA, "ADO", "ARE", "ARG", "A…
$ woe_id <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ adm0_a3_is <chr> "ABW", "AFG", "AGO", "AIA", "ALB", "ALA", "AND", "ARE", "AR…
$ adm0_a3_us <chr> "ABW", "AFG", "AGO", "AIA", "ALB", "ALD", "AND", "ARE", "AR…
$ adm0_a3_un <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ adm0_a3_wb <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ continent <chr> "North America", "Asia", "Africa", "North America", "Europe…
$ region_un <chr> "Americas", "Asia", "Africa", "Americas", "Europe", "Europe…
$ subregion <chr> "Caribbean", "Southern Asia", "Middle Africa", "Caribbean",…
$ region_wb <chr> "Latin America & Caribbean", "South Asia", "Sub-Saharan Afr…
$ name_len <dbl> 5, 11, 6, 8, 7, 5, 7, 20, 9, 7, 14, 10, 23, 22, 17, 9, 7, 1…
$ long_len <dbl> 5, 11, 6, 8, 7, 13, 7, 20, 9, 7, 14, 10, 27, 35, 19, 9, 7, …
$ abbrev_len <dbl> 5, 4, 4, 4, 4, 5, 4, 6, 4, 4, 9, 4, 7, 10, 6, 4, 5, 4, 4, 5…
$ tiny <dbl> 4, NA, NA, NA, NA, 5, 5, NA, NA, NA, 3, NA, NA, 2, 4, NA, N…
$ homepart <dbl> NA, 1, 1, NA, 1, NA, 1, 1, 1, 1, NA, 1, NA, NA, 1, 1, 1, 1,…
$ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-69.89912 1..., MULTIPOLYGON (…
Just the US
According to this link, these are the latitude and longitude bounds of the continental US:
Adding in our monitors…
Adding in county lines
Simple feature collection with 3076 features and 1 field
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -124.6813 ymin: 25.12993 xmax: -67.00742 ymax: 49.38323
Geodetic CRS: +proj=longlat +ellps=clrk66 +no_defs +type=crs
First 10 features:
ID geom
alabama,autauga alabama,autauga MULTIPOLYGON (((-86.50517 3...
alabama,baldwin alabama,baldwin MULTIPOLYGON (((-87.93757 3...
alabama,barbour alabama,barbour MULTIPOLYGON (((-85.42801 3...
alabama,bibb alabama,bibb MULTIPOLYGON (((-87.02083 3...
alabama,blount alabama,blount MULTIPOLYGON (((-86.9578 33...
alabama,bullock alabama,bullock MULTIPOLYGON (((-85.66866 3...
alabama,butler alabama,butler MULTIPOLYGON (((-86.8604 31...
alabama,calhoun alabama,calhoun MULTIPOLYGON (((-85.74313 3...
alabama,chambers alabama,chambers MULTIPOLYGON (((-85.59416 3...
alabama,cherokee alabama,cherokee MULTIPOLYGON (((-85.46812 3...
And now onto the map…
monitors <- ggplot(data = world) +
geom_sf(data = counties, fill = NA, color = gray(.5))+
coord_sf(xlim = c(-125, -66), ylim = c(24.5, 50),
expand = FALSE) +
geom_point(data = pm, aes(x = lon, y = lat), size = 2,
shape = 23, fill = "darkred") +
ggtitle("Monitor Locations") +
theme(axis.title.x=element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
monitorsWrangle counties:
truth <- ggplot(data = world) +
coord_sf(xlim = c(-125,-66),
ylim = c(24.5, 50),
expand = FALSE) +
geom_sf(data = map_data, aes(fill = value)) +
scale_fill_gradientn(colours = topo.colors(7),
na.value = "transparent",
breaks = c(0, 10, 20),
labels = c(0, 10, 20),
limits = c(0, 23.5),
name = "PM ug/m3") +
ggtitle("True PM 2.5 levels") +
theme(axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())# fit data
RF_final_train_fit <- parsnip::fit(RF_tuned_wflow, data = train_pm)
RF_final_test_fit <- parsnip::fit(RF_tuned_wflow, data = test_pm)
# get predictions on training data
values_pred_train <- predict(RF_final_train_fit, train_pm) %>%
bind_cols(train_pm %>% select(value, fips, county, id))
# get predictions on testing data
values_pred_test <- predict(RF_final_test_fit, test_pm) %>%
bind_cols(test_pm %>% select(value, fips, county, id))
values_pred_test# A tibble: 292 × 5
.pred value fips county id
<dbl> <dbl> <fct> <chr> <fct>
1 11.6 11.2 1033 Colbert 1033.1002
2 11.9 12.4 1055 Etowah 1055.001
3 11.1 10.5 1069 Houston 1069.0003
4 13.9 15.6 1073 Jefferson 1073.0023
5 12.0 12.4 1073 Jefferson 1073.1005
6 11.3 11.1 1073 Jefferson 1073.1009
7 11.5 11.8 1073 Jefferson 1073.5003
8 11.0 10.0 1097 Mobile 1097.0003
9 11.9 12.0 1101 Montgomery 1101.0007
10 12.9 13.2 1113 Russell 1113.0001
# ℹ 282 more rows
map_data <- inner_join(counties, all_pred, by = "county")
pred <- ggplot(data = world) +
coord_sf(xlim = c(-125,-66),
ylim = c(24.5, 50),
expand = FALSE) +
geom_sf(data = map_data, aes(fill = .pred)) +
scale_fill_gradientn(colours = topo.colors(7),
na.value = "transparent",
breaks = c(0, 10, 20),
labels = c(0, 10, 20),
limits = c(0, 23.5),
name = "PM ug/m3") +
ggtitle("Predicted PM 2.5 levels") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())library(patchwork)
(truth/pred) +
plot_annotation(title = "Machine Learning Methods Allow for Prediction of Air Pollution", subtitle = "A random forest model predicts true monitored levels of fine particulate matter (PM 2.5) air pollution based on\ndata about population density and other predictors reasonably well, thus suggesting that we can use similar methods to predict levels\nof pollution in places with poor monitoring",
theme = theme(plot.title = element_text(size =12, face = "bold"),
plot.subtitle = element_text(size = 8)))